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Asymptotically Efficient Distributed Estimation 
With Exponential Family Statistics 

Soummya Kar and Jose M. F. Moura 
Abstract 

The paper studies the problem of distributed parameter estimation in multi-agent networks with exponential family 
observation statistics. A certainty-equivalence type distributed consensus -I- innovations estimator is proposed, which, 
under global observability of the networked sensing model and mean connectivity of the inter-agent communication 
network, is shown to yield consistent parameter estimates at each network agent. Further, it is shown that the distributed 
estimator is asymptotically efficient, in that, the asymptotic covariances of the agent estimates coincide with that of the 
optimal centralized estimator, i.e., the inverse of the centralized Fisher information rate. From a technical viewpoint, 
the proposed distributed estimator leads to non-Markovian mixed time-scale stochastic recursions and the analytical 
methods developed in the paper contribute to the general theory of distributed stochastic approximation. 

Index Terms 

Distributed estimation, exponential family, consistency, asymptotic efficiency, stochastic approximation. 

1. Introduction 

A. Motivation 

Motivated by applications in multi-agent networked information processing, we revisit the problem of distributed 
sequential parameter estimation. The setup considered is a highly non-classical distributed information setting, 
in which each network agent samples over time an independent and identically distributed (i.i.d.) time-series 
with exponential family statisticil] parameterized by the (vector) parameter of interest. Further, in the spirit of 
typical agent-networking and wireless sensing applications with limited agent communication and computation 
capabilities, we restrict ourselves to scenarios in which each agent is only aware of its local observation statistics 
and, assuming slotted-discrete time, may only communicate (collaborate) with its agent-neighborhood (possibly 
dynamic and random) once per epoch of new observation acquisition, i.e., we consider scenarios in which the 
inter-agent communication rate is at most as high as the observation sampling rate. Broadly speaking, the goal 
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of distributed parameter estimation in such multi-agent scenarios is to update over time the local agent estimates 
by effectively processing local observation samples and exchanging information with neighboring agents. To this 
end, the paper presents a distributed estimation approach of the consensus + innovations type, which, among other 
things, accomplishes the following: 

Consistency under distributed observability: Under global observabilit^ of the multi-agent sensing model and 
mean connectivity of the inter-agent communication-collaboration network, our distributed estimation approach is 
shown to yield strongly consistent parameter estimates at each agent. Conversely, it may be readily seen that the 
conditions of global observability and mean network connectivity are in fact necessary for obtaining consistent 
parameter estimates in our distributed information-collaboration setup. Indeed, global observability is the minimal 
requirement for consistency even in centralized estimation, whereas, in the absence of network connectivity, 
there may be locally unobservable agent-network components which, under no circumstance, will be able to 
generate consistent parameter estimates. Interestingly, the above leads to the following characterization of distributed 
observability: distributed observability, i.e., the minimal structural conditions on the sensing and communication 
models such that there exists a distributed estimation scheme leading to consistent parameter estimates at each 
network agent, is equivalent to global observability and mean network connectivity. 

Asymptotic efficiency: Under the same conditions of distributed observability, the proposed distributed estimation 
approach is shown to be asymptotically efficient. In other words, in terms of asymptotic convergence rate, the local 
agent estimates are as good as the optimal centralized, i.e., they all achieve asymptotic covariance equal to the 
inverse of the centralized Fisher information rate. The key point to note here is that the above optimality holds as 
long as the mean communication network is connected irrespective of how sparse the link realizations are. 

Conforming to the sensing and communication architecture, our distributed estimation approach is of the consensus 
+ innovations type, in which at every observation sampling epoch the local agent estimate refinement step embeds 
a single round of local neighborhood estimate mixing, the consensus or agreement potential (D-ISIj with local 
processing of the sampled new observation, the innovation potential. Multi-agent stochastic recursive algorithms 
of the above type have been proposed in prior work - see, for example, early work IT], |l5l-||7] on parallel 
stochastic gradient and stochastic approximation; consensus + innovation approaches for nonlinear distributed 
estimation Isj, detection ll9l- lfTTI . adaptive control IIT2I and learning ifTsl : diffusion approaches for network inference 
and optimization lfT4l . ifTSl : networked LMS and variants llT4l . llT6l - lfT9l . The key distinction between the above 
and the current work is that, in the former the focus has been mainly on consistency (or minimizing the asymptotic 
error residual between the estimated and the true parameter), but not on asymptotic efficiency. The requirement 
of asymptotic efficiency complicates the construction of such distributed algorithms non-trivially and necessitates 
the use of time-varying consensus and innovation gains in the update process; further these time-varying gains 
driving the persistent consensus and innovation potentials need to decay at strictly different rates in order for the 

^Global obsei'vability means that for every pair of different parameter values, the con'esponding probability measures induced on the aggregate 
or collective agent observation set are distinguishable. For setups involving exponential families distinguishability is aptly captured by strict 
positivity of the Kullback-Liebler (KL) divergence between the corresponding measures, see Assumption l2.2l for details. 
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distributed scheme to achieve the asymptotic covariance of the optimal centralized estimator. Such mixed time-scale 
construction for asymptotically efficient distributed parameter estimation in linear statistical models was obtained 
in II20I . ||2T1 . However, in contrast to optimal estimation in linear statistical models ll20l . II2TI . in the nonlinear 
non-Gaussian setting, the local innovation gains that achieve asymptotic efficiency are necessarily dependent on 
the true value of the parameter to be estimated, and on the statistics of the global sensing model. Since the 
value of the parameter and (and hence the optimal estimator gains) are not available in advance, our proposed 
distributed estimation approach involves a distributed online gain learning procedure that proceeds in conjunction 
with the sequential estimation task. As a result, a closed-loop interaction occurs between the gain learning and 
parameter estimation which is reminiscent of the certainty-equivalence approach for adaptive estimation and control 
- although the analysis methodology is significantly different from classical techniques used in adaptive processing 
(see, for example, ll22l . ||231 and the references therein, in the context of parameter estimation), primarily due to 
the distributed nature of our problem. Specifically, in our approach, each agent runs simultaneously three local 
time recursions: (1) an auxiliary distributed consensus + innovations estimator driven by non-adaptive innovation 
gains; (2) an online distributed learning procedure that uses the auxiliary distributed local estimators to generate a 
sequence of optimal adaptive innovations gains; and 3) the desired distributed consensus + innovations estimator 
whose innovations are weighted by the optimal adaptive innovations gains, thus achieving asymptotic efficiency. We 
note in this context that the idea of recovering asymptotically efficient estimates from consistent (but suboptimal) 
auxiUary estimates, although novel from a distributed estimation standpoint, has been investigated in prior work on 
(centralized) recursive estimation, see, for example, ll24l . ||251 . 

In summary, in contrast to existing work, the current paper presents a principled development of distributed 
parameter estimation as applicable to the general and important class of multi-agent statistical exponential fami- 
lies; paralleling the classical development of centralized parameter estimation, it quantifies notions of distributed 
observability, performance metrics, information measures and algorithmic optimality. Due to the mixed time-scale 
behavior and the non-Markovianity (induced by the learning process), the stochastic procedure does not fall under 
the purview of standard stochastic approximation (see, for example, ll26l ) or distributed stochastic approximation 
(see, for example, Q, E-i), CD, HU, (StI-EII) procedures. In fact, some of the intermediate results on the 
pathwise convergence rates of mixed time-scale stochastic procedures obtained in the paper are more broadly 
applicable and contribute to the general theory of distributed stochastic approximation. In this context, we note 
the study of mixed time-scale stochastic procedures that arise in algorithms of the simulated anneaUng type (see, 
for example, lf30l ). Apart from being distributed, our scheme technically differs from ll30l in that, whereas the 
additive perturbation in ll30l is a martingale difference sequence, ours is a network dependent consensus potential 
manifesting past dependence. In fact, intuitively, a key step in the analysis is to derive pathwise strong approximation 
results to characterize the rate at which the consensus term/process converges to a martingale difference process. 
We also emphasize that our notion of mixed time-scale is different from that of stochastic algorithms with coupling 
(see ifSTI . If32ll ). where a quickly switching parameter influences the relatively slower dynamics of another state, 
leading to averaged dynamics. Mixed time scale procedures of this latter type arise in multi-scale distributed 
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information diffusion problems, see, in particular, the paper 1331 . that studies interactive consensus formations in 
Markov-modulated switching networks. 

We comment briefly on the organization of the rest of the paper. Section ll-BI sets up notation. The multi-agent 
sensing model is formaUzed in Section whereas, preliminary facts pertaining to the model and assumptions are 
summarized in Section 12-BI Section 13-AI describes the distributed estimation approach and the main results of the 
paper (concerning consistency and asymptotic efficiency of the proposed approach) are stated in Section 13-BI The 
major technical developments are presented in Section |4] culminating to the proofs of the main results in Section |5] 
Finally, Section |6] concludes the paper. 

B. Notation 

We denote by R the set of reals, IR+ the set of non-negative reals, and by MJ' the fc-dimensional Euclidean. For 
a, 6 e M, we use a V 5 and a A 6 to denote the maximum and minimum of a and b respectively. For deterministic 
M+ -valued sequences {at} and {bt}, the notation at = 0{bt) denotes the existence of a constant c > such that 
at < cbt for all t sufficiently large. Further, the notation at = o(6() is used to indicate that at /6( —> as t —> oo. 
For M-|_ -valued stochastic processes {at} and {bt}, the corresponding order notations are to be interpreted to hold 
pathwise almost surely (a.s.). 

The set of kx k real matrices is denoted by R'^^*^. The corresponding subspace of symmetric matrices is denoted 
by S'^. The cone of positive semidefinite matrices is denoted by S'ji, whereas S'ji^ denotes the subset of positive 
definite matrices. The k x k identity matrix is denoted by Ik, while 1^ and 0^ denote respectively the column 
vector of ones and zeros in M'' . Often the symbol is used to denote the k x p zero matrix, the dimensions being 
clear from the context. The symbol T denotes matrix transpose, whereas, for a finite set of matrices An G M'^^^p, 
n — 1, • • • ,N , the quantity Vec(A„) denotes the (fci + • • • + k^) x p matrix [Aj , ■ ■ ■ , Aj^]^ obtained as the 
(column-wise) stack of the matrices An- The operator || || applied to a vector denotes the standard Euclidean £2 
norm, while applied to matrices it denotes the induced C2 norm, which is equivalent to the matrix spectral radius 
for symmetric matrices. Also, for a e R'^ and e > 0, we will use IBe(a) to denote the closed e-neighborhood of a, 
i.e., 

Be(a) = {b e R'' : ||b - a|| < e} . 

The notation A® B is used to denote the Kronecker product of two matrices A and B. 
The following notion of consensus subspace and its complement will be used: 

Definition 1.1. Let N and M be positive integers and consider the Euclidean space R^*^. The consensus or 
agreement subspace C of R^*^ is then defined as 

C={zeR"*^ : z = Iat (g) a/or some a e R^} . 
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The orthogonal complement of C in R is denoted by C . Finally, for a given vector z G M , its projection on 
the consensus subspace C is to be denoted by zc, whereas, z^i = z — zc denotes the projection on the orthogonal 
complement C^. 

Also, for z £ C, we will denote by z" the vector a G ]R*^ such that z — If^ (E) a. 

Time is assumed to be discrete or slotted throughout the paper The symbols t and s denote time, and T+ is 
the discrete index set {0, 1, 2, • • • }. The parameter to be estimated belongs to a subset Q (generally open) of the 
Euclidean space R^^. We reserve the symbol 9 to denote a canonical element of the parameter space 9, whereas, 
the true (but unknown) value of the parameter (to be estimated) is denoted by 6*. The symbol x„(t) is used to 
denote the R*^-valued estimate of 9* at time t at agent n. Without loss of generality, the initial estimate, x„(0), 
at time at agent n is assumed to be a non-random quantity. 

Spectral graph theory: The inter-agent communication topology at a given time instant may be described by 
an undirected graph G — (V, E), with V — [1 • • • A^] and E denoting the set of agents (nodes) and inter-agent 
communication links (edges) respectively. The unordered pair {n, I) E E if there exists an edge between nodes n 
and I. We consider simple graphs, i.e., graphs devoid of self-loops and multiple edges. A graph is connected if 
there exists a patlj^ between each pair of nodes. The neighborhood of node n is 

n„ ^ {I eV\{n,l) £ E}. 

Node n has degree d„ = |f2„| (the number of edges with n as one end point.) The structure of the graph can be 
described by the symmetric NxN adjacency matrix, A — [Ani], Ani = 1, if {n, I) E E, Ani — 0, otherwise. Let the 
degree matrix be the diagonal matrix D = diag {di ■ ■ ■ d^). By definition, the positive semidefinite matrix L = D—A 
is called the graph Laplacian matrix. The eigenvalues of L can be ordered as = Ai(L) < X2{L) < • • • < Ajv(L), 
the eigenvector corresponding to Ai (L) being (1/ Vn) In- The multiplicity of the zero eigenvalue equals the number 
of connected components of the network; for a connected graph, X2{L) > 0. This second eigenvalue is the algebraic 
connectivity or the Fiedler value of the network; see fM] for detailed treatment of graphs and their spectral theory. 

2. Multi-agent sensing model 

Let 9* E M.^^ be an Af -dimensional (vector) parameter that is to be estimated by a network of N agents. 
Throughout, we assume that all the random objects are defined on a common measurable space (57, equipped 
with a filtration {Tt}- Probability and expectation, when the true (but unknown) parameter value 6* is in force, 
are denoted by Fg* (•) and Ee« [•] respectively. All inequalities involving random variables are to be interpreted a.s. 

A. Sensing Model 

Each network agent n sequentially observes an independent and identically distributed (i.i.d.) time-series {yn{t)} 
of noisy measurements of 9*, where the distribution fi^ of y„(t) belongs to a ©-parameterized exponential family, 

'a path between nodes n and I of length m is a sequence (n = jq > *l i ' ■ ■ i *m = I) oi vertices, such that {ik,ik+l) £ E\/ < k < m — 1. 
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formalized as follows: 

Assumption 2.1. For each n, let i/„ be a a-finite measure on M*^". Let gn : R*^" h->- R^^ be a Borel function 
such that for all 6 e R*^ the following expectation exists: 

K{e) = / e^^9"(''")dz.„(y„) < oo. (1) 

Finally, let {/x^}, for G M^^, be the corresponding -parameterized exponential family of distributions on R^^", 
i.e., for each e R^^ the probability measure /x^ on R^^" is given by the Radon-Nikodym derivative 

^(y„) = Q{o^9Ayr,)~^Ae)) 

for all y„ G R^^", where ■)/'«(•) denotes the function '4'n{(i) = logA„(0). 

We assume that each network agent n obtains an {J^t+i}-odapted independent and identically distributed (i.i.d.) 
sequence {yn{t)} of observations of the (true) parameter 6* with distribution iJi^{6*), and, for each t, y„(i) is 
independent of Tt- Further, we assume that the observation sequences {yn(i)} and {yi(t)} at any two agents n 
and I are mutually independent. 

We will also denote by yt the totality of agent observations at a given time t, i.e., yt — Vec(y„(t)) = 
[y7(0i ' ' ■ :yjv(^)]^- For ^ ^ /x^ denote the product measure fif (E) ■ ■ ■ <E) fJ.% on the product space 

R*^i (g) ■ ■ ■ (g) R*^"; it is readily seen that {/x^} is a ©-parameterized exponential family with respect to (w.rt.) the 
product measure i/ = i/i (g) • • • ® i/jv and given by the Radon-Nikodym derivatives 

dv ' ' 

where y = Vec(y„) denotes a generic element of the product space and the functions g( ) and are given by 

N N 
9{y)^Y.9n{yn) and ^(0) = ^VnW (2) 

n— 1 n— 1 

respectively. 

It is readily seen that under Assumption 12. 1 1 the global observation sequence {yt} is { J't+ij-adapted, with yt 
being independent of Ft and distributed as /x^ for all t. 

For most practical agent network applications, each agent observes only a subset of of the components of 
the parameter vector, with A/„ ^ M. It is then necessary for the agents to collaborate by means of occasional local 
inter-agent message exchanges to achieve a reasonable estimate of the parameter 0* . To formalize, while we do not 
require local observability for 0* , we assume that the network sensing model is globally observable as follows: 

Assumption 2.2. The network sensing model is globally observable, i.e., we assume K{6, 9') > and K{6' , 6) > 
for each pair [0, 9') of parameter values, where ^{0, 0') denotes the Kullback-Leibler divergence between the 



7 



distributions fi^ and fi^ , i.e., 

j?(0,0')-£iog(^(y)) V(y). 

B. Some preliminaries 

We state some useful analytical properties associated with the multi-agent sensing model, in particular, the 
implications of the global observability condition (see Assumption 12.2b . Most of the listed properties are direct 
consequences of standard analytical arguments involving statistical exponential families, see, for example, ll35l . 



Proposition 2.1. Let Assumption |Z7] hold. Then, 

(1) For each n, the function ipn{') '■s infinitely differentiable on M.^^ . 

(2) For each n, let h„ : M*^ ^ R*^ be the gradient of ip„{-), i.e., h„{e) = VeVn(^) for all 6 G K*^. The 



,0 



hn{e) = / 5n(y„)rfAtf,(yn) ye e (3) 

and the following inequality (monotonicity) holds for each pair {0,0'^ in R*^.- 

{9-e'y {hn{e)-hn{e'))>o. (4) 

(3) If, in addition. Assumption \2.2\ holds, denoting by h{-) the gradient of ip{-), see (|2]l, we have the following 
strict monotonicity 

N 

{e - e'Y {h{e) - h{e')) - ^ (0 - 0')^ {K{e) - h,,{e')) > 

n=l 

for each pair (0, 0') in R^^ such that ^ O'. 

Proof: The first assertion is an immediate consequence of the fact that the function ipn{d) associated with the 
exponential family {fi^} is infinitely differentiable on the interior of the natural parameter space (the set on which 
the expectation in ([T]l exists), see Theorem 2.2 in ll35l . The second assertion constitutes a well-known property of 
statistical exponential families (see Corollary 2.5 in [35]). The same corollary in ll35l asserts that the inequality 
in (IHl is strict whenever the measures /i^ and fj,^ are different for 6 ^ 6', the latter being ensured by the positivity 
of the Kullback-Leibler divergences as in Assumption 12.21 ■ 
The next proposition characterizes the information matrices (or Fisher matrices) associated with the sensing model 
and may be stated as follows (see |35| for a proof): 



Proposition 2.2. Let Assumption 12.71 hold. Then, 

^For a function /(<?), ^efiS) G M*^ denotes the vector of partial derivatives, i.e., tlie i-tli component of \7gf{0) is given by ^gg^^ . Tlie 
Hessian V|/(0) G M^^^'*^ denotes tlie matrix of second order partial derivatives, whose i, j'-th entry corresponds to gg^gg\ ■ 
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(1) For each n and G , let In{d) denote the Fisher information matrix associated with the exponential family 

/nW = -^ {^ltr^yn))d^J^i{yn). (5) 

where the expectation integral is to be interpreted entry-wise. Then, In{6) is positive semidefinite and satisfies 
In{d) = Ve {hn(9)) for all 9, with hn{-) denoting the function in (O. 

(2) If in addition. Assumption \2.2\ holds, the global Fisher information matrix I{0), given by, 

/W = -/^(v^^(y))d/x«(y), (6) 

is positive definite and satisfies 

N N 



for all 6 e 



For the multi-agent statistical exponential families under consideration, the well-known Cramer-Rao characteri- 
zation holds, and it may be shown that the mean-squared estimation error of any (centralized) estimator based on 
t sets of observation samples from all the agents is lower bounded by the quantity t^^I^^{9*), where 9* denotes 
the true value of the parameter Making t tend to oo, the class of asymptotically efficient (optimal) estimators is 
defined as follows: 

Definition 2.1. An asymptotically efficient estimator of 9* is an {Tt\-adapted sequence {9t\, such that {9t\ is 
asymptotically normal with asymptotic covariance I~^{9*), i.e., 

Vt + T(dt-9*^ =^Af{o,r\9*)), 

where => and Af{-, •) denote convergence in distribution and the normal distribution respectively. 

Remark 2.1. CentraUzed estimators that are asymptotically efficient for the proposed multi-agent setting may be 
obtained using now-standard results in point estimation theory. For instance, the (centralized) maximum likelihood 
estimator is known to achieve asymptotic efficiency; specifically, there exists an {J^f}-adapted sequence {9t}, such 
that 

9t e argmaxegRM ( ^ log -^(y^) j a.s. for all t, 

and {9t} is asymptotically normal with asymptotic covariance I^^{9*). Note that, apart from being centralizecjfl 
the maximum likelihood estimator, as implemented above, consists of a batch-form realization. To cope with 
this, extensive research has focused on the development of time-sequential (but centralized) estimators based on 
recursively processing the agents' observation data y^; asymptotically efficient recursive centralized estimators of 



^The term centralized estimator refers to a hypothetical fusion center based estimator that has access to all agent observations at all times. 
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the stochastic approximation type have been developed by several authors, see, for example, ||241 . ||25l , Il36l - ll38l , 
that are asymptotically efficient. 

We emphasize that the centralized (recursive or batch) estimators, as discussed above, are based on the availabiUty 
of the entire set of agent observations at a centralized resource at all times, which further require the global model 
information (the statistics of the agent exponential families {/x^}) for all n such that the nonlinear innovation gains 
driving the recursive estimators may be designed appropriately to achieve asymptotic efficiency. In contrast, the goal 
of this paper is to develop collaborative distributed asymptotically efficient estimators of 9* at each agent n of the 
network, in which, (i) the information is distributed, i.e., at a given instant of time t each agent n has access to its 
local sensed data yn{t) only; (ii) to start with, each agent n is only aware of its local sensing model {/x^} only; and, 
(iii) the agents may only collaborate by exchanging information over a (sparse) pre-defined communication network, 
where inter-agent communication and observation sampling occurs at the same rate, i.e., each agent n may only 
exchange one round of messages with its designated communication neighbors per sampling epoch. To this end, the 
proposed estimators consist of simultaneous distributed local estimate update and distributed local gain refinement 
(learning) at each network agent n, with closed-loop interaction between the estimation and learning processes. 
From a technical point of view, in contrast to centralized stochastic approximation based estimators, the estimators 
developed in the paper are of the distributed nonlinear stochastic approximation type with necessarily mixed time- 
scale dynamics; the mixed time-scale dynamics arise as a result of suitably crafting the relative intensities of the 
potentials for local collaboration and local innovation, necessary for achieving asymptotic efficiency. Distributed 
estimators of mixed time-scale dynamics have been introduced and studied in ||8l, ll20ll ; we refer to them as consensus 
+ innovations estimators. 

3. Asymptotically Efficient Distributed Estimator 

In this section, we provide distributed sequential estimators for 6* that are not only consistent but asymptotically 
optimal, in that, the local asymptotic covariances at each agent coincide with the inverse of the centralized 
Fisher information rate I^^{9*) associated with the exponential observation statistics in consideration. Other than 
challenges encountered in the distributed implementation, a major difficulty in obtaining such asymptotically efficient 
distributed estimators concerns the design of the local estimator or innovation gains (to be made precise later); in 
particular, in contrast to optimal estimation in linear statistical models 1201 , 11211 . in the nonlinear non-Gaussian 
setting, the innovation gains that achieve asymptotic efficiency are necessarily dependent on the true value 9* of the 
parameter to be estimated. Since the value of 9* (and hence the optimal estimator gains) are not available in advance. 
We propose a distributed estimation approach that involves a distributed online gain learning procedure that proceeds 
in conjunction with the sequential estimation task. As a result, a somewhat closed-loop interaction occurs between 
the gain learning and parameter estimation that is reminiscent of the certainty equivalence approach to adaptive 
estimation and control - although the analysis methodology is significantly different from classical techniques used 
in adaptive processing, primarily due to the distributed nature of our problem and its mixed time-scale dynamics. 
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Specifically, the main idea in the proposed distributed estimation methodology is to generate simultaneously two 
distributed estimators {x„(t)} and {x„(t)} at each agent n; the former, the auxiliary estimate sequences {x„(t)}, 
are driven by constant (non-adaptive) innovation gains, and, while supposed to be consistent for 6*, are suboptimal 
in the sense of asymptotic covariance. The consistent auxiliary estimates are used to generate the sequence of 
optimal adaptive innovation gains through another online distributed learning procedure; the resulting adaptive gain 
process is in turn used to drive the evolution of the desired estimate sequences {x„(t)} at each agent n, which 
will be shown to be asymptotically efficient from the asymptotic covariance viewpoint. As will be seen below, 
we emphasize here that the construction of the auxiliary estimate sequences, the adaptive gain refining, and the 
generation of the optimal estimators are all executed simultaneously. 

A. Algorithms and Assumptions 

The proposed optimal distributed estimation methodology consists of the following three simultaneous update 
processes at each agent n: (i) auxiliary estimate sequence {x„(t)} generation; (ii) adaptive gain refinement; and 
(iii) optimal estimate sequence {x„(t)} generation.Formally: 

Auxiliary Estimate Generation: Each agent n maintains an {J"t}-adapted R*^-valued estimate sequence {x„(i)} 
for 6*, recursively updated in a distributed fashion as follows: 

x„(i + l) =x„(t) (x„(t)-Xi(0)+«t(ff«(y„(i))-/i„(x„(t))), (7) 

where {/3t} and {at} correspond to appropriate time-varying weighting factors for the agreement (consensus) and 
innovation (new observation) potentials, respectively, whereas, il„(t) denotes the {J^f+ij-adapted time-varying 
random neighborhood of agent n at time t. 

Optimal Estimate Generation: In addition, each agent n generates an optimal (or refined) estimate sequence 
{x„(i)}, which is also {J^t}-adapted and evolves as 

x„(t + l) =x„(t)-/3t (Mt) - Mt)) + atKnit) (gniynit)) - K{^„{t))) . (8) 

Note that the key difference between the estimate updates in (|7]l and dHJ is in the use of adaptive (time-varying) 
gains Kn{t) in the innovation part in the latter, as opposed to static gains in the former Specifically, the adaptive 
gain sequence {Knit)} at an agent n is an {J^t}-adapted M^^^*^-valued process which is generated according to 
a distributed learning process as follows. 

Adaptive Gain Refinement: The {J^t}-adapted gain sequence {Kn{ty\ at an agent n is generated according to a 
distributed learning process, driven by the auxiliary estimates {x„(t)} obtained in (|7]i, as follows: 

where, {<Pf} is a deterministic sequence of positive numbers such that iy9f — > as t oo, and, each agent n 
maintains another {J^t}-adapted S:'[_^-valued process {G'„(t)} evolving in a distributed fashion as 

Gn{t + 1) = Gn{t) - Pt {Gn{t) - Gi{t)) + at (/„(x„(t)) ~ G„(t)) (10) 
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for all t, with some positive semidefinite initial condition G'„(0) and /„(•) denoting the local Fisher information 
matrix, see (|5]l. 

Before discussing further, we formalize the assumptions on the inter-agent stochastic communication and the 
algorithm weight sequences {at} and {/3t} in the following: 

Assumption 3.1. The {Tt-\-i\-adapted sequence {Lt\ of communication network Laplacians (modeling the agent 
communication neighborhoods ri„(i)-5 at each time t) is temporally i.i.d. with Lt being independent of J-f for each 
t. Further, the sequence {Lt\ is connected on the average, i.e., \2{L) > 0, where L = Eg* [Lt] denotes the mean 
Laplacian. 

Assumption 3.2. The weight sequences {/?(} and {at} satisfy 

where 6 > and < r2 < 1/2. 

Further, the sequence {ipt\ in (|9|l satisfies 



lim + l)^Vt = (12) 



for some positive constant fi2- 



The following weak linear growth condition on the functions /i„( ) driving the (nonlinear) innovations in (|7]i-(|8]l 
will be assumed: 

Assumption 3.3. For each 6 G M.^\ there exist positive constants cf and cf, such that, for each n, function hn{-) 
in ^ satisfies the local linear growth condition, 

\\K{e')~K{e)\\ <cf ||0'-0||+c^, 

for all e' e MJ"'^. 
B. Main Results 

We formally state the main results of the paper, the proofs appearing in Section |5] 

Theorem 3.1. Let Assumptions \2.2\3.1\3'J\ and \TJ\ hold. Then, for each n the estimate sequence {x„(t)} is strongly 
consistent. In particular, we have 



lim(t + l)nix„(t)-r||=0 =1 (13) 



for each n and r G [0, 1/2). 



The consistency in Theorem 13.11 is order optimal in that (ITji fails to hold with an exponent r > 1/2 for any 
(including centralized) estimation procedure. 
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The next result concerns the asymptotic efficiency of the estimates generated by the proposed distributed scheme. 
Theorem 3.2. Let Assumptions \2.2\3.1\33\ and \Tj\ hold. Then, for each n we have 

y/{t + 1) (Xn(i) -e*)=^M (0,7-1(0*)) ^ 

where Af{-, •) and denote the Gaussian distribution and weak convergence, respectively. 

4. A Generic Consistent Distributed Estimator 

With a view to understanding the asymptotic behavior of the auxihary estimate processes {x„(t)}, n = 1, • • • ,N, 
introduced in Section 13-AI see (|7]l, we study a somewhat more general class of distributed estimate processes with 
time-varying local innovation gains. Other than establishing consistency of these estimates (see Theorem 14. 11 1, 
we obtain pathwise convergence rate asymptotics of the estimate processes to 0* (see Theorem 14.2b . These 
latter convergence rate results will be used to analyze the impact of the auxiliary estimates in the adaptive gain 
computation (|9]l-(fT0ll. 

Theorem 4.1. For each n, let {z„(i)} be an M.^^ -valued {J-t}-adapted process (estimator) evolving as follows: 

Z„(t+1) =Z„(t)-/3t (z«W-ZiW)+«t^nW(5n(y«(t)-/i„(z«W)). (14) 

nGf2„(t) 

Suppose Assumptions \2.1\3.3\ and \3.1\ on the network system model hold, and the weight sequences {Pt\ <^nd {at} 
satisfy Assumption \3.2\ Additionally, let the matrix gain processes {Kn{t)} be -valued {Tt]-adapted, and there 
exist a positive definite matrix K, and a constant > 0, such that the gain processes {Kn{t)} converge uniformly 
to K. at rate T3, i.e., for each d > 0, there exists a deterministic time tg, such that for all n 

Fg, (sup{t + ly^ \\Kn{t) - IC\\ <s] =1. (15) 

\t>ts / 
Then, for each n, {z„{t)} is a consistent estimator of 9*, i.e., z„{t) 6* as t 00 a.s. 

The proof of Theorem 14.11 is accomplished in steps, the key intermediate ingredients being Lemma 14.11 and 
Proposition 14.11 concerning the boundedness of the processes {z„(f)}, n = I,-- - ,N, and a Lyapunov type- 
construction, respectively. 

Lemma 4.1. Let the hypotheses of Theorem \4. 1 1 hold. Then, for each n, the process {z„ (t)} is bounded a.s., i.e., 

Pe- (^sup||z„(t)|| < = 1. 

Proof: Let z„(<) = z„(<) - 0* and denote by Zf, Zt and 6* the R^^-valued Vec (z„(t)), Vec (z„(<)), and 
Ia? <8) 0* , respectively. Noting that (Lf ® Im) {'^n ® d*) = 0, the process {it} is seen to satisfy 

zt+i ^zt-(3t {Lt (S>lM)zt- atKt (hi^t) - 1(8*)'^ + atKt (g{yt) - W*)) , (16) 



where 



h{zt) = Vec (/i„(z„(t))) , h{e )= Vec (/i„(0*)) , ^(yt) = Vec (.g„(y„(t))) 



(17) 



and Kt — Diag(i4r„(<)). Note that, by hypothesis, Kt G S^:j: and define the ]R+-valued {J'fl-adapted process 
{Vt} by 

Vt^zJ {lN<S)IC~^)zt (18) 

for all t. Note that by (fT6] l we obtain 

(/jv (»^)"^Zf+i = (/at ® /C"^)zt -/3t {Lt®K:~^)zt 
-at {lN®K.)-^Kt ih{zt)-h(e*))+at {In <E> ICy^Kt (g{yt) -W*)') . 
By (O we have for all t > 

^8' [g{yt)-h(e*)\ =0, 

and using the temporal independence of the Laplacian sequence we obtain 

Ee- [Vt+i I J"t] =Vt- 2f3tzJ (L^K,-^) zt - 2atzJ {In ® JC'^) Kt (h{zt) - 7^(0*)) 
+ PM^e* [(I ^ hi) {In ® /C-i) (I ® /a/)] 2* 
+ 2at/3t27 (L® /m) (/^v ® K,-^) Kt {h{zt - 7^(0*)) 
+ a? (7^(zt - He*))^ Kt {In ® /C^^) Kt (hizt - 77(0*)) 

+ a?Ee. (^g{yt)~h(e*)y Kt{lN(E>IC-')Kt{g{yt)-h(0*)) (19) 

for all t > 0. 

Recall the definition of consensus subspace in Definition 1 1.1 1 and note that by using the properties of the Laplacian 
L and matrix Kronecker products we have 

zj (L®JC-^)zt - {zt)l^ (Lt^JC-^) {zt)c± > A2(I)Ai {JC'^) \\{zt)c±\f (20) 

for all t > 0, where Ai {IC^^) > denotes the smallest eigenvalue of the positive definite matrix K.^^. 
Now consider the inequality 

N 



zJ i^h{zt) - h{9*)j = J2 M) - e*)'^ {hn{zn{t)) - K{e*)) > 

n=l 

(where the non-negativity of the terms in the summation follows from Proposition 12. Il l, and note that, by Assump- 
tion 13.31 and hypothesis (fTSl l. there exist positive constants Ci and ti large enough such that 

zJ {In ® IC-') Kt (77(zt) - 77(0*)) 

> z7 (77(zt) - 77(0*)) - \zj {In ® JC-') {Kt - {In ® IC'^)) (h{zt) - 77(0*)) 

> \\iN(dJC-^\\ \\Kt- {In®IC-^)\\ ||77(zt)-77(0*) 

>-ci(l/(t + l)^^')(l + ||2,||^) 



for all t > ti, where we also use the inequality ||z4|| < ||zt|p + 1. Similarly, by invoking the boundedness of 
the matrices involved and the linear growth condition on the /i„(-)-s and making ci and ti larger if necessary, we 
obtain the following sequence of inequalities for all t > ti: 

zjEg. [{L (g, hi) {In ®JC~^){L® hi)] 
= [xtVci- Ee- [(L®hi) {In ® IC'^) (I® /m)] {zt)c± < d \\{^t)c±f , 
zj (L®Im) {In ® IC-') Kt (h{zt) - hie*)) < Ci (l + pj") , 

(h{zt - h{e*))^ Kt {In ® /C-i) Kt (h{zt - 7^(0*)) < c, (l + \\ztf) , 



and 



Eg. 



g{yt) - h{e*)y Kt {In ® IC~') Kt (g(yt) - h{9*)) 



<ci, 



(21) 



where the last inequality uses the fact that g{yt) possesses moments of all orders due to the exponential statistics. 
Noting that there exist positive constants C2 and C3 such that 



C2 ||2t||' < zJ {In ® K,-^) zt = Vt< C3 \\zt 
for all t, by (O-dlB we have for all t > ti 



Eg. [Vt+i I Tt] < (^1 + Ciat 

-C5 II (2*)c- II' + C6 



1 



{t + 1)^3 

at 



+ /3t + at 



Vt 



(22) 



(23) 



+ atPt + a^t 



{t + ly^ 

for some positive constants C4, C5 and cq. Since /3( as < — 00 by ( fTTT i. we may choose t2 large enough (larger 
than ti) such that {Pt — Pt) > for all t > <2- Further, the hypotheses on the weight sequences (fTTT i confirm the 
existence of constants T4 and strictly greater than 1, and positive constants cj and cs, such that 

1 . \ . cr 



and 



C4at 



{t + l)-3 



+ /3t + at] < 



{t + ly 



at 



{t + ly 



atPt + a? ) < 



C8 



[t + iy 



It 



It 



for all t > t2 (by making ^2 larger if necessary). By the above construction we then obtain 

Ee* [Vt+i I J"t] < {l + jt)Vt+j't 
for all t > t2 with the positive weight sequences {74} and {j'^} being summable, i.e., 

7t < 00 and ^ 7^ < 00. 



(24) 



(25) 
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Note that, by ( |25] l. the product n^t(l + 7s) exists for all t, and define by {Wt} the -valued {J"(}-adapted 
process such that 

Coo \ oo 

l[{^+ls)]Vt + J2<^ Vi. (26) 
s=t ) s=t 

By (|24li, the process {PVt} may be shown to satisfy 

Ee* [VFt+i I Tt\ < Wt 

for all t > t2- Being a non-negative supermartingale the process {Wt} converges a.s. to a bounded random variable 
W* as < — > oo. It then follows readily by ( |26T l that Vt -> W* a.s. as < — > oo. In particular, we conclude that the 
process {Vt} is bounded a.s., which establishes the desired boundedness of the sequences {z„(t)} for all n. ■ 
The following useful convergence may be extracted as a corollary to Lemma 14.11 

Corollary 4.1. Under the hypotheses of Lemma \4.1\ there exists a finite random variable V* such that Vt — > V* 
a.s. as t oo, where Vt = zj (/at /C^^) Zt as in ( llSl l. 

The following Lyapunov-type construction, whose proof is relegated to Appendix |A] will be critical to the 
subsequent development. 

Proposition 4.1. Let e e (0, 1) and denote the set 

< . (27) 

For each t > 0, denote by Tit '■ R^^^ M- R the function given by 

Ht{z) ^^(z-e*y (L®K.-^) (z-^*) + (z-^*)^ ih{^)-h(e*)) (28) 

for all z G R^^^, where the matrix JC^^ E ^++ '^^^ hp > Q is a constant. Then, there exist > and a constant 
6e > such that for all t > 

Ut{2)>\ 

We now complete the proof of Theorem 14.11 

Proof of Theorem \4.1\ In what follows we use the notation and definitions formulated in the proof of 
Lemma 143] Let us consider e e (0, 1) and let denote the {Ft} stopping time 

Pe=inf{i>0 : Zt ^ T J , 

where is defined in jZTl l. Let {Vt} be the {J"t}-adapted process defined in JTSl l and denote by {V^} the stopped 
process 



z e 



t,NM 



£ < 



z-e 



Vz e T, 
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which is readily seen to be {Tt} adapted. Noting that 



i+l 



Vt+il{pe>t) + Vpl{p,<t) 



and the fact that the indicator function I {p^ > t) and the random variable Vp^l [pe < t) are adapted to Ft for all 
t (pe being an {J-f} stopping time), we have 

Ee- [Vt%, I Ft] = Ee* [Vt+i I Ft] I (pe >t) + Vpl{p, < t) (29) 

for all t. 

Recall the function Ht(-) defined in (l28] i: setting hp = 1/2 in the definition of T-Lti ) we obtain 
2^t27 (L (g) /C-i) 2t + 2at27 (/jv ® K.-^) Kt (h{zt) - h(e*)] 

= 2atnt{zt) + PtzJ (I ® /C-i) 2t 
+2at27 (/^ ® K.-^) {Kt - {In ® /C-^)) (/^(zt) - 7^(0*)) . 
A slight rearrangement of the terms in the expansion (|T9] then yields 

Ee* [V^t+i I Ft] = Vt~ 2atnt{zt) - PtzJ {L ® JQ-^) % (30) 
-2atzJ {In ® IC'^) {Kt ~ {In ® IC'^)) (h{zt) - W*)'^ 
+P^zjEg-^ [{L^Im) {In ® /C-i) (I® /m)] 2* 
+2atfitzJ {L(E)Im) {In <E) JC'^) Kt (h{zt - 1(6* 
Wt {h{zt - h{e*))^ Kt {In <E) JQ-^) Kt (h{zt - h(e* 

+a2Ee. (giyt) - W))" Kt {In ® K.-^) Kt {g{yt) - W*)] 

for all t > 0, where T-Lt{-) is defined in (|28] |. The inequalities in (l20li- (l22l i then show that there exist positive 
constants bi, 62 and 63, and a deterministic time ti (large enough), such that, 

Ee* [Vt+i I Ft] < (1 + 61 {at{t + l)""-' + + at/?*)) Vt ~ 2atHt{zt) (31) 

-62 (A - II (zt)c^f + 63 + 1)-"^' + a? + atPt) 

for all i > ti. Note that, by definition, on the event {p^ > t} we have zt E F^, and hence, an immediate application 
of Proposition 14.11 establishes the existence of a positive constant and a large enough deterministic time > 0, 
such that, 

nt{zt)l{pe > t) >W\zt\\H{p, > t) 



for all i > ij. By ( l22b and ([30l)-([3T]) and making larger if necessary, it then follows that there exist a constant 
64(e) > such that 

Ee. [Vt+i I -Ft] I (p, >t)< [(1 - b4e)at + h {at{t + l)""^ + a? + atPt)) Vt 
-b2 {I3t - Pi) II (2t)c^ II' + 63 {at{t + 1)-"-^ + a? + atPt)] I {Pe > t) 
for all t > t^. Since at — and /3t — J' as t — > 00, by choosing large enough we may assert 

64(e)at - bi {at{t + 1)"''^ + a? + at/?*) > (64(£)/2)at, Vi > te, 

and the existence of positive constants 65 and T4 such that 

63(at(t+l)"^=' +a? + atA) < 65ai(i + 1)"^*, Vt > i^- 

We thus obtain for t > 

Ee* [14+1 I -Ft]I(Pe >0 < [{l~{bi{e)/2)at)Vt+b5at{t + iy^^]l{p,>t). (32) 
Note that, by definition of Fg, 

||zt|p>e2 on{zterj, 

and, hence, by (l22l i we conclude that there exists a constant be{£) > such that 

Vt>he{e) on {p,>t}. 

By ( [32] l we then have for all t > 

Ee- [Vt+i I J-t] I (pe > < - b7{£)at + b^at{t + 1)-"^] I [p, > t) 

with 67(e) being another positive constant. Finally, the observation that (67(e)/2) at > 65at(i+ l)^'^" eventually 
leads to 

Ee* [Vt+i I Ft] I (pe >t)< [Vt - (67(e)/2) at] I {pe > t) 
^Vtl{p,>t)-bs{e)atl{pe>t) 

for all t > te (making larger if necessary), where 68(e) — 67(e)/2. 
By ( |29l ) we then obtain 

Ee- [Vt%, I Tt] < Vtl {p, >t) + Vtl (pe <t)~ 68(e)atl {p, > t) (33) 

= Vl -bs{e)atl{pe>t) 
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for all t > t^. Note that the {J^t}-adapted process {Vf}t>t^ satisfies Eg. [V'j^j^jJ^f] < for all t > t^; hence, being 

a (non-negative) supermartingale it converges, i.e., there exists a finite random variable V* such that — ?► V* a.s. 

as t — > oo. Now consider the {J^t}-adapted M+ -valued process {Wjf } given by 

t-i 

Wt' = Vt' + bs{e)Y,(^sl{Pe> s), (34) 

s=0 

and note that, by (1331 ) we obtain 

t 

Ee. I J-t] < V^' ~ 68(e)«tl (Pe > t) + fo8(e) (P^ > = ^^t' 

for all i > t^; hence {Wl}t>t^ is a non-negative supermartingale and there exists a finite random variable W* 
such that — W* a.s. as t oo. We then conclude by ( l34b that the following limit exists: 

lim 68(e) V a^I (Pe > s) = - K* < oo a.s. (35) 

>oo ^- — ^ 

s=0 

Given that X]s=o Q^s — > oo as t — > oo, the limit condition in ( |35] ) is fulfilled only if the summation terminates at a 
finite time a.s., i.e., we must have < oo a.s. 

To summarize, we have for each e G (0, 1), < oo a.s., i.e., the process {zt} exits the set in finite time 
a.s. In particular, for each positive integer r > 1, let pi/^ be the stopping time obtained by choosing e = 1/r 
and consider the sequence {zp^^^} (which is well defined due to the a.s. finiteness of each pi/r) and note that, by 
definition, 

II V/ J e [0,l/r)U (r,oo) a.s. (36) 
However, the a.s. boundedness of the sequence {zt} (see Lemma 1431 1 implies that 

Pg. (||2pi/J > r i.o.) =0, 

where i.o. stands for infinitely often as r — !• oo. Hence, by ( [36l ) we conclude that there exists a finite random 
integer valued random variable r* such that ||zpj^^ || < 1/r for all r > r*. This, in turn imphes that ||zpi/^|| — > 
as r — > oo a.s., and, in particular, we obtain 

Pe* fliminf llzfll ^ o] =1. 

By ( l22b we may also conclude that liminff^oo Vt — a.s. Noting that the limit of {Vt} exists a.s. (see Corollarv l4.1l i 
we further obtain Vf — > as t — > oo a.s., from which, by another application of (l22t . we conclude that zt — > as 
i -> oo a.s. and the desired consistency assertion follows. ■ 
The other major result of this section concerns the pathwise convergence rate of the processes {z„(t)} to 6*, 
stated as follows: 
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Theorem 4.2. Let the processes {z„(t)} be defined as in (I14l l and the assumptions and hypotheses ofTheorem \4.1\ 
hold. Then, there exists a constant fi > such that for all n we have 

¥o. (lim it + ir\\zn{t)-e*\\=0) =1. 

In order to obtain Theorem 14.21 we will first quantify the rate of agreement among the individual agent estimates. 
Specifically, we have the following (see Appendix lAl for a proof): 



Lemma 4.2. Let the hypotheses of Lemma \4.1\ hold. Then, for each pair of agents n and I, we have 

Pfl. f lim {t + l)"|lz„(t) - z,(t)|l = O) = 1, 

for all T e (0, 1 - ra). 

We now complete the proof of Theorem 14.21 

Proof of Theorem \4.2\ In what follows we reuse the notation and intermediate processes constructed in the 
proofs of Lemma 143] and Theorem 14. II Recall {Vf} to be the {J^f}-adapted process defined in ( fTsT l. By (ISTT i (and 
the development preceding it) we note that there exist positive constants bi, 62, and 63, and a deterministic time ti 
(large enough), such that, 

Ee* [14+1 I H < (1 + 61 {at{t + + a? + atpt)) Vt - 'iatUt{'Lt) (37) 
-62 (A - fii) II (2t)c^ f + &3 {at(t + 1)-^^' + + at A) 
for all t > ti, where the function 'Hf(-) is defined in ( |28] l. By ( |54l ) we obtain 

Ht(z)> (zc-r)^(Mzc)-Mr)) + (zc^)^(7i(z)-7i(r)) m 



(Mz)-Mzc)) 

(z-^ - r)^ (hiz'^) - + (zc^)^ (mz) - 7^(r )) 

fzc-r)^(Mz)-Mzc)) 



for all z e M^^^. 

Note that, by Proposition 12. II h{-) is continuously differentiable with positive definite gradient Veh{0*) = I{6*) 
at 6*; hence, by the mean- value theorem, there exists eo > such that for all 9 £ M^g{9*) we have 

h{e) - h{e*) = + i?(6/, e*)) {e - e*) , 09) 

where R{-,6*) is a measurable R*'^^*^ -valued function of such that 

||i?(0,r)|| < v0eB,„(r), (40) 
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with Xi{I{6*)) > denoting the smallest eigenvalue of I{6*). Also, observing that the function h{-) is locally 
Lipschitz, we may conclude that there exists a constant such that 

||Mz)-Mz')|| <4ol|z-z'll Vz,z'eB,„(r). (41) 
Now note that, by Theorem 14. II z* ^ a.s. as t ^ cx), and, by Lemma l4~2l there exists a constant r > such that 

Qnn(t + iri|(z,)cJl -O) -1. 

Now consider 5 > (arbitrarily small) and note that by Egorov's theorem there exists a (deterministic) time ts > 
(chosen to be larger than ti in (|37]|). such that Pe» (As) > 1 — 5, where As denotes the event 



As = ■{ sup 

<t>ts 



zt-e 



< eoj U |sup(t + 1)" ||(zt)c^|| < eoj 



Consequently, denoting by pg the {J-t} stopping time 



PS = inf |i > ts 



zt 9 



> 



sa or 



we have that 



Now consider t e [ts,ps)', noting that 



(ps = oo) > 1 - S. 



\z^-0*\\ <\\zt-e II <£o, 



(42) 



we have by the construction in ([39]l-(|40li 

((z,)c - r ) ^ (7^ ((z,)c) - h (r ) ) = (z,- - r )^ (Mz? - 

> (||/(r )|| - ||i?(z^ r)||) ||z,- - r f > (i/2)Ai(/(r)) ||z," - r f 

> b4Vt 

for some constant &4 > 0. 

Similarly, using dTTT i. we have the following inequalities for t € [^5,^5): 



and 



(zOc^ [Hzt] -hie ))< Uzt)c4 Hzt) - h{9 ) 



((zOc - (/i(zt) - /i((zOc)) < ||(zt)c - -40 - (zt)cll 

< £o4o II (Zt)c^ 



(43) 



(44) 
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Hence, from (|38] l and (|43]l-(l44li. we conclude that for t E [tg, ps) we have 

nt{zt)>biVt-2el£,„it + l)-\ (45) 
Let {F/} be the ]R+-valued { J"f }-adapted process such that Vf = Vtl{t < ps) for all t. Noting that 

= Vt+ilit + l<ps)< Vt+ilit < ps), 

we have 

Ee* [K+i I J't] < I(t < ps)Eg-' [Vt+i I J^t] yt. (46) 

For t > ts we have by (|45] l 

^t(zt)I(i < ps) > {biVt - 2eX(i + 1)"") Mt < ps), 

hence, it follows from ( |37] | and ( |46] | that 

Ee- [l^ti I ^t] < (1 + 6i + 1)""^' + + a*/?*)) < Ps) 

-2at {biVt - 2el£,,{t + l)-^) I(t < ps) 

-b2 [Pt - f3f) \\izt)c^ f+bs {at{t + l)--"- +a^t+ atPt) 

<{l-at (2&4 - bi{t + 1)-^' - biat - biPt)) v! ~ 62 (A - Wt)cAf 

+at {b^(t + 1)-"-^ + 630* + + 4£^4o(< + 1)"") 

for all t > ts- Observing the decay rates of the various coefficients (see (fTTT i). we conclude that there exist a 
deterministic time t'g > ts, and positive constants (independent of 6) 65, &6 and r4 such that 

Ee* [Kti I ^t] < (1 - fosaO + hatit + 1)""" (47) 

and b^at < 1, for all t > t'g. 

Let us now choose a constant (independently of 5) such that Jl e (0, 65 A r4 A 1). Then, using the inequaUty 

we have for all t > ts 

{t + 1)^ (1 - br^at-i) < (1 + T^.t-') (1 - 55.i-i) <t'P{l- (65 - 7Z).t-i) 

and 

{t + l)H-^-^-^ = (1 + < (1 + {t's)~^f t-^-'-'-^-'^l 
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Thus, from ( |47] i we obtain 

Ee* [{t + Ifvf I ^t-i] < (1 - (^>5 - Jl)-t-^) t^Vl, + be (1 + (4)"')^*"'"^"^"^^ 

for all t > t'g for some constant 6^ > 0. Since T4 > JI, we have i^^^^'^*^^' < 00; denoting by {14^/} the 
non-negative {J^f}-adapted process such that 

00 

W,' ^{t + ifV^' + 66 (1 + {t'srT E (48) 

s=t+l 

we have that Ee* [Vl^/|J-"t_i] < for all t > t'g. Hence, the process {M^/}t>t^ is a non-negative supermartingale 

and converges a.s. to a finite non-negative random variable . By ( |48] ) we further conclude that (i + 1)^V/ — > 
a.s. as < — > 00. Now let /i G (0,/!) be another constant (chosen independently of S); noting that the limit is 
finite, the above convergence leads to 

Pe* f lim (i + = 0) 1. (49) 

\t-^oo / 

By (gill and the fact that Vf = Vtl{t < ps) for all t, we conclude that, 

lim {t + 1)^14 = a.s. on {ps = 00}. 

Hence, by (|42] | we obtain 

Pe. f lim + n'^Ft = 0) > 1 - (5. 

Since (5 > was chosen arbitrarily and > is independent of 8, we have, in fact, (< + 1)^ Vt — > a.s. as t — > 00 
by taking 8 to zero. The desired assertion follows immediately by noting the correspondence between the processes 
{2t} and {yj (see ■ 
The assertions of Theorem 14.21 may readily be extended to the case of non-uniform (over sample paths) con- 
vergence of the matrix gain sequences {Kn{t)\ to their designated limit K. as follows (see Appendix lAl for a 
proof): 

Corollary 4.2. Let the sequences {z„(i)} be defined as in ( I14I) . Let Assumptions \2.2\3.1\ and \3.2\ hold as in the 
hypotheses of Theorem \4.2\ and the matrix gain sequences {Kn{t)} be such that (t + l)'^^\\Kn{t) — /C|| — )■ a.s. 
fli i — )• 00 for all n. Then, the assertions of Theorem \4.2\ continue to hold, i.e., there exists p > such that 
{t + l)''||z„(t) — 0*11 a.s. as t — > 00 for all n. 

Note that Corollarv l4.2l is in fact a restatement of Theorem 14 . 2 1 under the relaxed assumption that the convergence 
of the matrix gain sequences {Kn{t)} need not be uniform over sample paths. 

5. Proofs of Main Results 
Throughout this section. Assumption 12.21 and Assumptions 13.1113.31 are assumed to hold. 
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A. Convergence of Auxiliary Estimates and Adaptive Gains 

The first result concerns the consistency of the auxiliary estimate sequence {x„(<)} at each agent. To this end, 
noting that the evolution of the auxiliary estimates, see (|7]i, corresponds to a specific instantiation of the generic 
estimator dynamics analyzed in Theorem |42] (with Kn{t) — Im for all n and t), we immediately have the following: 

Lemma 5.1. For each n, the auxiliary estimate sequence {x„(i)} (see Section \3-A^ is a strongly consistent estimate 
of 6*. In particular, there exists a positive constant /ig such that (t + 1)^° — as t ~¥ oo a.s. for all 



Lemma ISTI and local Lipschitz continuity of the functions /i„( ) lead to the following characterization of the 
adaptive gain sequences {Kn{t)} ^ driving the local innovation terms of the agent estimates {x„(/:)} (O (see 
Appendix 151 for a proof): 

Lemma 5.2. There exists a positive constant r' such that, for each n, the adaptive gain sequence {Kn{t)^, see (|9]l, 
converges a.s. to N.I^^iG*) at rate t' , i.e.. 



where /(•) denotes the centralized Fisher information, see (|6j. 

As an immediate consequence of the above development, we have the following consistency of the distributed 
agent estimates {x„(^)}: 

Corollary 5.1. For each n, the estimate sequence {x„(t)} {see Section \3-A\ is a strongly consistent estimate of 
6*, i.e., x„(i) 6* as t oo a.s. 

Proof: Note that, by Lemma there exists r' > such that {t + lY'\\Kn{t) - A^./"i(0*)|| ^ as t cx3 
a.s. Thus, the sequences {x„(t)} fall under the purview of Theorem 14.21 (with K. = N.I^^{9*)) and the assertion 
follows. ■ 

B. Proofs of Theorem 13.71 and Theorem 13.21 

The key idea in proving Theorem 13. II and Theorem 13 . 2 1 consists of comparing the nonlinear estimate recursions, 
see (Ull, to a suitably linearized recursion. To this end, we consider the following result on distributed linear 
stochastic recursions developed in 1|21_| in the context of asymptotically efficient distributed parameter estimation 
in linear multi-agent models. The result to be stated below is somewhat less general than the development in |21|, 
but serves the current scenario. 

Theorem 5.1 (Theorem 3.2 and Theorem 3.3 in 11211 ). For each n, let {v„(f)} be an M.^^ -valued {Tt}-adapted 



n. 
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process evolving in a distributed fashion as follows: 

v„(t + l) - v„(t)-/3t (v„(i)-vKi))+at^nW(S„(r-v„(t))+CnW), 
ien„(t) 

where Bn, for each n, is an Mn x Af matrix (for some positive integer Mn) such that 

(1) the matrix A = i-^ positive definite; 

(2) for each n, the M x M„ matrix valued process {Dn{t)} is {Tt\-adapted with Dn{t) — > N.A^^B^ as t ^ oo 
a.s.\ 

(3) for each n, the {J-t+i}-adapted sequence is such that {CnW} '■^ independent of Tt for all t, the se- 
quence {Cn(^)} i-i-d. with zero mean and covariance Im^' '^^^ there exists a constant £ > such that 

E[||C„W||2+^]<oo; 

(4) the Laplacian sequence {Lt} representing the random communication neighborhoods VLn{t), n — 1, ■ ■ ■ , N, 
satisfies Assumption \3.1\ and 

(5) the weight sequences {at} and {/?*} satisfy Assumption 13.21 
Then the following hold for the processes {v„(t)}, n — 1, • • • , N: 

(1) for each n and r G [0, 1/2), we have 

Pf lim(t + l)n|v„(i)-r|| =0) =1; 

(2) for each n, the sequence {v„(t)}, viewed as an estimate of 6*, is asymptotically normal with asymptotic 
covariance A^^, i.e., 

Vm(v„(t)-r) =^AA(0,A-i). 

The following corollary to Theorem 15.11 will be used in the sequel. 

Corollary 5.2. For each n, let {v„(t)} be the {Ft] -adapted M.'^^ -valued process evolving in a distributed fashion 
as 

V„(t+1) =V„(t)-A (VnW-ViW)+«t^nWan(e*)(e*-V„(t))+W„(t)), (50) 

iesi„(t) 

where 

(1) for each n, {w„(t)} is the {J- t^i}- adapted sequence given by w„(i) = .9„(y„(i)) — hn{9*) for all t\ 

(2) for each n, {Kn{t)} denotes the {J-t}-adapted innovation gain sequence defined as in (|9|l; 

(3) the Laplacian sequence {Lt} representing the random communication neighborhoods fi„(t), n = 1, • • • ,iV, 
satisfies Assumption 13.71 and 

(4j the weight sequences {at} and {/3t} satisfy Assumption 13.21 
Then the following hold for the processes {v„(i)}, n ~ 1, • • • , N: 
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(1) for each n and r G [0, 1/2), we have 

Pe* (lim (t + l)n|v„(t)-r|| =0) = 1; 

(2) for each n, the sequence {v„(t)}, viewed as an estimate of 9* , is asymptotically efficient as per Definition \2J\ 
i.e., 

VtTT(v„(0-r) =^ M{o,r\e*)). 

Proof: Note that, by Proposition 12.21 for each n the local Fisher information matrix In{d*) is positive 
semidefinite; hence, there exists (for example, by a Cholesky factorization) a positive integer Af„ and an Af„ x M 
matrix i?„ such that In{6*) = BjBn- By Proposition 12.11 for each n, the sequence {w„(i)} possesses moments 
of all orders, is zero-mean with covariance ln{0)- Since /„(©*) — BjBn, there exists another {J-t+i adapted 
sequence {Cn(^)} (not necessarily unique depending on the rank of the matrix Bn) satisfying condition (3) in the 
hypothesis of Theorem 15. II such that B^Cj^(t) = w„(t) for all t a.s. 

Also, for each n, denote by {Z3„(t)} the M x Af„ matrix-valued {J"t}-adapted process such that Dn{t) = 
Kn{t)Bj^ for all t; since, bv LemmalS^ Kn{t) N.I-^{e*) as t ^ oo a.s., we have that Ai (i) -> N.I-^{e*)Bl 
as t — > oo a.s. 

It is now clear, that the evolution of the sequences {v„(i)} may be rewritten as follows in terms of the newly 
introduced variables: 

v„(t+l) =v„(t)-A (v„W-v,(i)) + ati5„(i)(S„(r-v„(i)) + C„(t)). (51) 

iGO„(t) 

Finally noting that, by construction and Proposition 12.21 

N N 

/(r) = ^/„(r) = 5]i?,|B„, 

n—l n—1 

we conclude that the evolution in ( fSTl ) falls under the purview of Theorem |5.1| (with the identification that A = I{6*)) 
and the desired assertions follow. ■ 
The processes {v„(t)} as introduced and analyzed in Corollarv I5.2l mav. in fact, be viewed as linearizations of 
the nonlinear estimator dynamics, see (|8}, the linearizations being performed in the vicinity of the true parameter 
value 6* . Clearly, in order for such Unearization to provide meaningful insight into the actual nonlinear dynamics 
of the estimators {x„(i)}'s, it is necessary that the latter approach stay close to 6* (around which the linearization 
is carried out) asymptotically, which, in turn, is guaranteed by the consistency of the estimators {x„(t)}'s, see 
Corollarv 15. II The consistency allows us to obtain insight into the detailed dynamics of the estimators {x„(t)}'s by 
characterizing the pathwise deviations of the former from their linearized counterparts. These ideas are formalized 
below in Lemma 15.31 (see Appendix |B] for a proof) leading to the main results of the paper as presented in 
Section [3111 
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Lemma 5.3. For each n, let {x„(t)} be the estimate sequence at agent n as defined in (jSJ, and {v„(i)} denote 
the process defined in ( I50l l under the hypotheses of Corollarv \5.2\ Then, there exists a constant t > 1/2 such that 

Pe- (lim (t + lf ||x„(t)-v„(t)|| = o) = 1 

/or all n. 

With the above development, we may now complete the proofs of Theorem 13. II and Theorem 13.21 as follows. 
Proof of Theorem 13.71 Let r E [0, 1/2) and note that, for each n, 

{t + 1)^ ||x„(t) -e*\\<{t + ly ||x„(i) ~ v„(i)|| + {t + ly ||v„(i) -e*\\, (52) 

where {v„(t)} is the (linearized) approximation introduced and analyzed in Corollary 15.21 By Lemma [53] (first 
assertion) and Corollary O since r < 1/2, we have [t + l)^||x„(t) - v„(i)|| and {t + l)^||v„(t) - ^ 
respectively as t oo a.s. Hence, by ( |52] | we obtain [t + l)'^||x„(t) — ^ as t cx) a.s., thus establishing 
Theorem 13.11 ■ 
Proof of Theorem 13.21 Note that by Lemma 15.31 (first assertion), for each n 

Pe. (lim ||ViTT(x„(t)-r)-x/<TT(v„W-r)|| =o) 

= Pe. (lim VtTT||x„(t)-v„(t)|| =0) =1. 

Hence, in particular, the sequences {^/t + l(x„(t) — 0*)} and {\/F+T(v„(t) — 0*)} possess the same weak limit 
(if the latter exists for one of the sequences); the asymptotic normality (efficiency) in Theorem 13.21 then follows 
immediately by the corresponding for the {v„(t)} sequence in Corollary 15.21 (second assertion). ■ 

6. Conclusions 

We have developed distributed estimators of the consensus + innovations type for multi-agent scenarios with 
general exponential family observation statistics that yield consistent and asymptotically efficient parameter estimates 
at all agents. Moreover, the above estimator properties and optimality hold as long as the aggregate or global sensing 
model is observable and the inter-agent communication network is connected in the mean (otherwise, irrespective of 
the network sparsity). Along the way, we have characterized analogues of classical system and information theoretic 
notions such as observability to the distributed-information setting. 

An important future research question arises naturally: in this paper we have assumed that the parametrization 
is continuous unconstrained, i.e., may take values over the entire space R*^. It would be of interest to extend 
the approach to account for constrained parametrization - the parameter 9 could belong to a restricted subset 
C M^^ either because of direct physical constraints or due to constrained natural parameterizations of the local 
exponential families involved, i.e., the domains of definition of the functions A„(-) in ([T]i being strict subsets of 
. A specific instance being finite classification or detection (hypothesis testing) problems in which 9 may only 
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assume a finite set of valuej^ The unconstrained estimation approach (|7Tl-(fT0li may still be applicable to a subclass 
of such constrained cases by considering suitable analytical extensions of the various functions A„(-)'s, /i„(-)'s etc. 
over R^^; provided such extensions exisj^, the proposed algorithm will lead to asymptotically efficient estimates 
at the network agents although the intermediate iterates may not belong to 6. As a familiar example where such 
extension may be achievable by embedding, we may envision a binary hypothesis testing problem corresponding to 
the presence or absence of a signal observed in additive zero-mean Gaussian noise with known variance. In cases, 
where such analytical extensions may not be obtained, other modifications of the proposed scheme, for example 
by supplementing the local estimate update processes with a projection step onto the set Q, may be helpful. In the 
interest of obtaining a unified distributed inference framework, it would be worthwhile to study such extensions 
and modifications of the proposed scheme. 



Appendix A 
Proofs of Results in Section|4] 

Proof of Proposition \4.1\ Let z G and note that by reasoning along the lines of (|20] | we obtain 

-e*y (L®}C-^) U-e*) = {z.c±f (L (g) IC-^) zc± > A2(I)Ai(/C-i) ||zc^||^ 



(53) 

where z^x denotes the projection of z onto the orthogonal complement of the consensus subspace (see Defini- 
tion [TTTJ- We thus obtain 



-Htiz) > ^A2(L)Ai(/C-i) \\zc± f+{z- {h{z) - hie* 



(54) 



> 



at 



'*A2(L)Ai(/C 



zc 9 



Hzc) - h{e 



+ {zc^f (h{z) - h{e*)) + (zc - e*) {h{z) - h{zc)) 



In order to bound the last two terms in the above inequality, note that, for z e F^, by invoking Assumption 13.31 we 
obtain 



[zc^)^ (h[z)-h{e*))\<ci\\zcA\ (i + 



e 



<Cl(l/£+l)||zc^|!, 



(55) 



where ci is a positive constant. Also, by Proposition l2.1l the functions ) are infinitely continuously differentiable 
and hence locally Lipschitz; in particular, noting that the set F^ 



F; = {ze 



e 



< 1 



/4 



'We ai'e somewhat abusing the notion of estimation which, to be precise, coiTesponds to inferring continuous parameters as pursued in this 
paper. However, by considering constrained parametrization, we are essentially expanding its usage to general inference problems including 
detection and classification. 

'An idea related to such analytical extensions is that of embedding into an exponential family (see |39| for some discussion in a related but 
centralized context), in which, broadly speaking, the objective is to obtain an unconstrained exponential family whose restriction to coincides 
with the given constrained family. 
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is compact, there exists a constant £r^ > 0, such that, 

\\h{z)-h{z')\\ < er, l|z-z'll , Vz,z' e r;. 

By observing that for z G C 



zc-9 



< l/e, 



(56) 



we obtain zq G F^; hence, by ( |56] |. we may conclude that 

||Mz)-Mzc)|| <£r' ||z-zc|| =^r' INc^ 



for all z e Fg. Thus, for z e F^, we have 

'zc^e*y {hiz)-hizc)) <er'J\zc^\\\\z-e*j<{er'Je)\\zc^\. (57) 
Combining (l53]l-(l55Tl and ( fSTl l we then obtain 

Ht(z)> (^^A2(I)Ai(/C-i)||zc.||-ci + 1^ -^£1^ llzc^ll (58) 

+ (zc-0*)^(Mzc)-M0* 

for all z G Fe. By invoking standard properties of quadratic minimization, we note that there exist positive constants 
Ce, C3(e) and 04(2) such that for all t > 



^A2(L)Ai(/C-i) \\zc. II - ci f i + 1) - ^ ) ||zc. II > 
at \e I e ' 



for all z with llz^^H > C3(£) {at/ Pt), and 



■A2(L)Ai(/C-i)||zc^|| - ci - + 1 



|zc^ll > -~Ci{e) (at/Pt) 



(59) 



(60) 



Off \e ) £ 

for all z, in particular, z e F^. 

Now, note that, by Proposition 12. II (third assertion), for all z G R^*^ with zc ^ 6 we have 

(zc - r) (mzc) - h{e*)) - E (^c - «*)^ - hn{e*)) > 

(see also Definition ll.lb . Let us choose e' such that e' E (0,e); noting that the functions h„{-) are continuous and 
the set r> 



I, ATM 



ZC 9 



is compact, we conclude that there exists 5^ > such that 



Jnf [zc-e ) [h{zc)-h{e ))>6. 



(61) 



Further, since at/ Pt — > as i — > 00 (by hypothesis), there exist large enough and a constant (5^ > such that 

e' <e-C3{e){at/l3t) and C4(e) (at//3t) < (5^ - 4 (62) 
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for all t>ts. 

We now show that there exists S'^ > (independent of t) such that 

inf Ht{z) > S' 



(63) 



for all t > tg. To this end, for t > t^, let z e and consider the two cases as to whether \\zc± \ \ > C3(e) {at/(3t) 
or not. Noting that by Proposition 12.11 

' ^ ' ' i,NM 



(zc - 0*) (/i(zc) - h{9*)) > 0, Vz e 



we have by (l58])-(|59ll that 



Ht{z) > c. 



(64) 



for all z e Fe with \\zc± \\ > C3(e) [at/ (it)- Now consider the other case, i.e., let z e Fg with \\zc^ II < C3(e) {at/ fit)', 
note that, since t > t^, we have for such z by ( l62b 

zc -0*1 = - llzc^ll >£-C3{e) {at/Pt)e'. 

Hence, necessarily z e ^E,e' and we have by (l58]l.(l60ll. and (|6T1)-(|621) 

nt{z)>5,~Ci{e){at/Pt)>5e- (65) 
From ( |64] | and ( |65] | we then obtain for all z e F^ and t > 

Ht{z) >d',>0, 



where 5'^ = A S^, thus establishing the assertion in (l63T l. 

Finally, let — e^^e, and note that the desired claim follows by 



•Ht(z) >6',> S', (e 



6 



= 6. 



6 



z e 



< 1 on F.. 



for all z G Fg and t > t^, where we used that fact that 

Proof of Lemma 14.21 Before proceeding to the proof of Lemma 14.21 we state the following approximation 
results obtained in ||2T1 on convergence estimates of stochastic recursions (Lemma lA.U and certain attributes of 
time-varying stochastic Laplacian matrices (Lemma |A.2| |. to be used as intermediate ingredients in the proof. 



Lemma A.l (Lemma 4.3 in li21J ). Let {w(} be an R-^--valued {J-t} adapted process that satisfies 

wt+i < (1 - ri{t)) wt + r2{t)Ut (1 + Jt) . 



In the above, {ri{t)} is an {Tt+i\ adapted process, such that for all t, ri{t) satisfies < ri(i) < 1 and 

ai 



{t + ly^ 



<E[n{t) \Tt]<l 
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with ai > and < < 1.' {i"2{t)} is a deterministic sequence satisfying r2{t) < a2-{t + 1)^^^ for all t > 0, 
where a2 and 62 are positive constants. Further, let {Ut} and {Jt} be ]R-|_ valued {J-t} and {J-"f+i} adapted 
processes respectively with supoQ [/( < 00 a.s. The process {Jt} is i.i.d. with Jt independent of Tt for each t 
and satisfies the moment condition E [jf^^] < 00 for a constant e > 0. Then, if 82 > + 1/(2 + e), we have 
{t + lY«^rft -> a.s. as t 00 for all Sq £ [0,^2 -61- 1/(2 + e)). 

Lemma A.2 (Lemma 4.4 in ET\ ). Let {vft] be an M.^ -valued {J-t] adapted process such that wt E for all t, 
where denotes the orthogonal complement of the consensus subspace C, see Definition \l.l\ Also, let {Lt} be an 
{J-t}-adapted sequence of Laplacians satisfying Assumption 13.71 Then there exists an {Tt+i} adapted M.^-valued 
process {r^} (depending on {wj} and {Lt}), a deterministic time t^. (large enough), and a constant > 0, such 
that < < 1 a.s. and 

Wnm - PtLt ® /m) wtll < (1 - n) ||wt|| 



with 

c 



E [n I Ft] > \ a.s. 



for all t > tr, where the weight sequence {/3t} <^nd T2 are defined in (II lb . 

Proof of Lemma \4~2\ Recall the {J"t}-adapted process {zf} with Zf = Vec(z„(t)) for all t, and note that 
by (fT4l i we have 

zt+i = {Inm - PtLt (8) Im) zt - atKt (h{zt) - h(0*)^ + atKt (g{yt) - h{0*)) , 

the functions h(-) and g(-) being defined in ( [Tt] ). For each n, let z„(i) = z„(i) — z", and denote by {zf} the 
{J^f}-adapted process where zt = Vec(z„(i)) for all t. Using the fact {Lt (8) /m)(1jv 8) zf ) — 0, we have 

Zt+i = (Inm - PtLt ® Im) Zt - atU't + at Jj, (66) 
where {U't} and {J^} are {J'tj-adapted and {J't+i {-adapted processes given by 

U't = {Inm - (IatIw) ® Im) Kt (h{^t) - W*)) , (67) 

and 

Jt - {Inm - (Iwl^) ® Im) Kt (ff(yt) - W*)^ 



respectively. Note that by hypothesis supj ||A'(|| < 00 a.s., and, by Theorem 14. II supj ||zf|| < co a.s. Hence, by 
the linear growth condition on h{-) (see Assumption 13.3b . there exists an {J^tj-adapted process {[//} such that, 
ll^^tll < 'Jf for all t and sup^p \\Ut\\ < 00 a.s. Then, defining Ut to be 

Ut^Ui\/\\{lNM -{lNlN)(^lM)Kt\\ (68) 
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we have by (|67]l-(|68]l 

\\u'A\ + \\j[\\<Ut{i + Jt). 

with {Ut} being {J^t}-adapted and { Jf} being the {J't+ij-adapted process, Jt — ||ff(yt) — h{6 )|| for all t. Note 
that for every e > we have 

Ee- [J^"] < oo, (69) 

which follows from the fact that ^(yf) possesses moments of all orders (see Proposition 12. H . Hence, by ( l66T l we 
obtain 

||zt+i|| < ||(/jVAf - PtLt ® Im) ztll + atUt (1 + Jt) (70) 

Observe that, by construction, Zt e for all t, and hence, by Lemma lA.21 there exists an {Ft+i} adapted Re- 
valued process {rt} (depending on {zf} and {it}), a deterministic time tr (large enough), and a constant Cr > 0, 
such that < rt < 1 a.s. and 

\\{lNAi-PtLt®lM)^t\\ < il-rt)\\zt\\ (71) 

with 

for all t>tr.We then have by (l70li-(l7ni 

||zt+i|| < (l-rt)llztll +atC/f(l + Jt) Vt. (72) 

Now consider arbitrary e > and note that, under the moment condition ( l69l l, the stochastic recursion in (l72i falls 
under the purview of Lemma lA. 1 1 (bv taking 6i and 62 in Lemma IaTI to be r2 and 1 respectively), and we conclude 
that {t + l)'^||zi|| -> as t -> 00 a.s. for each re (0, 1 - T2 - 1/(2 + e)). Noting that 

\\Zn{t) - Zl{t)\\ < i|z„(i) - Z^ll + \\ziit) - Zn < 2 ||Z,|| 

for each pair n and / of agents, we may further conclude that 

Fg. ( lim {t + l)^||z„(t) - zi{t)\\ = 0) = 1, (73) 

for all T e (0, l-r2- 1/(2 + £)). 

Since the above holds for arbitrary e > 0, the desired assertion follows by making e tend to oo. ■ 
Proof of Corollary \4.2\ Since {t + \Y^\\Kn{t) — /C|| ^0 a.s. as < ^ 00 for all n, by Egorov's theorem, for 
each e > 0, there exist a deterministic > and a positive constant c^, such that 

Pe- f sup(i + 1)"^ \\Kr,{t) - /C|| >c^<e (74) 
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for all 71. Now, for such an e > 0, define, for each n, the following {J^t}-adapted sequence {Kf^{t)}: 

Kn{t) if t<t, 

Kit) = { Knit) if t>t, and \\K„{t) ~ JC\\ < cS + l)-^^ (75) 
/C otherwise. 

Note that, by the above construction, we have \\Knit) ~ < Ce(t + 1)^"^^ for aU i > te', hence, choosing 
r' e (0, Ta) to be a constant (independent of e), we have that 

(t + 1)^' \\K'^{t) - /C|| < c,(i + vt > 

for all n and each e > 0. Thus, clearly, for each e > and all n, the sequence {K^it)} converges a.s. to JC 
uniformly (over sample paths) at rate r' > 0, i.e., for each 5 > Q, there exists (deterministic) ts > Q such that 

Pe* f sup(t + 1)"' \\K^^{t) - K\\ <5]=l. 

\t>ts J 

Now, for each e > 0, let us define the {J'tj-adapted sequences {z^(t)}, n — 1, - ■ ■ ,N, evolving as 

z^(t + i) = z^(0-/3t i<it)-^nt)) + <^tKit)igniynit))-Kizim- 
;en„(t) 



Noting that 



we have 



sup||z^,(t)-z„(0|| =0^ on \sup\\K'^{t)~ K„{t)\\=0 



n.t 



•e- sup \\zUt) ~ z„(t)|| - > 1 - iVe (76) 



by (Eli-dTSll. 

The uniform convergence of the gain sequences {K^{t)} to K, at rate r' > ensures that, for each e > 0, the 
processes {zf^{t)} satisfy the hypotheses of Theorem l4.2l and. hence, there exists a positive constant /i (that depends 
on r' but not e), such that (t + l)''||z^(t) ~ 0*\\ ~^ as t ^ a.s. for each n. Hence, by ( |76] | we have 



lim (t + 1)^ ||z„(t) - 6»*|| ^Q)>l-Ne (77) 

for all n. Since e > is arbitrary and ^ does not depend on e, we may further conclude from ( |77t that (< + 
l)^||z„(i) - e*\\ ^ as t ^ cx) a.s. for all n. ■ 

Appendix B 
Proofs in Section|5] 

Proof of Lemma 15.21 The proof of Lemma 15.21 is accomplished in two steps: first, we show that the gain 
sequences reach consensus, and subsequently demonstrate that the limiting consensus value is indeed I^^{6*). To 
this end, consider the following: 
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Lemma B.l. Recall for each n, the {J^t}-odapted sequence {G'„(t)} evolving as in dlOt , and denote by {Gf} their 
instantaneous network averages, i.e., = (1/-^) S^iLi G'^{t) for all t. Then, for each n and r G [0, 1 — T2), we 
have 

Pe- ( lim [t + ly ||G„(t) - - 0) = 1, 



where T2 is the exponent associated with the weight sequence {Pt}, see Assumption 13.21 

Proof: We will show the desired convergence in the matrix Frobenius norm (denoted by || • in the following), 
the convergence in the induced £2 sense following immediately. Note that, by Lemma ISTl Xn(i) — t- 0* as t 00 
a.s. for all n, hence, for each ?i, by the continuity of the local Fisher information matrix we have that 

/„(x„(<)) ln{0*) as 00 a.s. 

Let Gn{t) = Gn{t) — G1 denote the deviation at agent n from the instantaneous network average G° and 
7° = (1/^) X]^^=i -^"(^"(^)) '■^^ network average of the /n(x„(t))'s. Also, let Gt and It denote the matrices 
Vec(^G„(i)) and Vec ^/„(t)^ respectively, where /„(t) — /„(x„(t)) — /° for all n. Using the following readily 
verifiable properties of the Laplacian Lt 



we have by ( fTOl l 



(Ijv ® hd) {Lt ® Im) = and [U ® Im) (Iw ® G?) = 0, 
Gt+i = [Inm - f3t {Lt <E) Im) - adNM) Gt + atlt 



(78) 



(79) 



for all t > 0. 

Since for all n, /„(x„(t)) — > /„(0*) as t — > 00 a.s., the sequences {/^(x^^(^))} are bounded a.s. and, in particular, 
there exists an {J'tj-adapted a.s. bounded process {Ut} such that \\It\\F < Ut for all t. For m e {I,-'' j^I}^ 
denote by Gm^t the m-th column of Gf Clearly, the process {Gm,t} is {J^t}-adapted and Gm,t G for all t. 
Hence, by Lemma IA!21 there exist a [0, l]-valued {J^f+i}-adapted process {rm^t} and a positive constant Cm.r such 
that 

{Inm - PtLt ® Im) Gm,t < (1 - rm,t) Gm,t 



and Eg. [?',„ t| J^t] > c^ rKt + 1)^^ a.s. for all t > Iq sufficiently large. Noting that the square of the Frobenius 
norm is the sum of the squared column £2 norms, we have 

M 



{Inm - l3tLt «) Im) Gt < ^ (1 - rra,t) 



ni—l 



<{l-nY 



Gt 



(80) 



where {rt} is the {J^f}-adapted process given by rt = ri.t A • • • A tm.i for all t. By the conditional Jensen's 
inequality we obtain 

M 
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for some constant Cr > and all t > tg. Since /?( /at — > oo as t — > oo, by making to larger if necessary, we obtain 
from dSOll 



[Inm ~ l^tLt® Im - atlNM)Gt < [Inm - PtLt® lM)Gt 

F 



at 



for all t > tQ. It then follows from ^9} and (HB that 



Gt 



+ at 



Gt 



Gt+i 



< 



{Inm - PtLt (E) hi - atlNA-i) Gt + atUt < (1 - rt/2) 

F 



Gt 



+ atUt 



(81) 



for all t > to. Clearly, the above recursion falls under the purview of Lemma [A. 11 (bv setting 6i, 62 and Jt in 
Lemma [A. 11 to T2, 1 and respectively), and we conclude that {t + l)'^||Gt||F —t' as t 00 a.s. for each 
T E [0, 1 — T2). The assertion in Lemma IbTI follows immediately. ■ 
We state another approximation result from BOl regarding deterministic recursions to be used in the sequel. 

Proposition B.l (Lemma 4.3 in ||40| ). Let {bt} be a scalar sequence satisfying 

bt+i < (1-^) bt + dt{t + l)-^ 

where c > t, t > 0, and the sequence {dt} is summable. Then \m\s\Ypt^^{t + l)'^6t < 00. 



We now complete the proof of Lemma [ 

Proof of Lemma \5.2\ Following the notation in the proof of Lemma IB. II and using properties of the graph 
Laplacian (fTsT l. the process {G^} (the instantaneous network average of the Gn (i)'s) may be shown to satisfy the 
following recursion for all t: 

Gt+, = {l-at)G^ +atl?. (82) 

Noting that the local Fisher information matrices /„(•) are locally Lipschitz in the argument and the fact that 
Xn(0 9* as t —i' 00 a.s. (see Lemma ISTt . we have that 



l|/r-(i/iv)/(r)||=o(v^^il|x„(t)-r||) 



(83) 



Since, by Lemma ISTl {t + l)^''||x„(t) — 6*\\ as t 00 a.s., we may further conclude from (l8Jt that 

- {l/N)Iie*)\\ =o{it + . (84) 

Now let T5 be a positive constant such that T5 < (/io A 1). Noting that at = (t + 1)^^ by definition, by (|84] l we 
may then conclude that there exists an R+ -valued {J'tj-adapted stochastic process {dt}, such that, 

at - il/N)Ii9*)\\ < dt{t + 1)--^ (85) 

for all t, with {dt} satisfying 



dt =o((t + l)-i-^«+^=) 



(86) 
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By ( [82] i and ( [85] l we then obtain 

||G?+i-(l/7V)/(r)|| < (l-(t+l)-i)||G?-(l/iV)/(r)|| (87) 
+ a,||/r-(l/iV)/(r)|| 

< (1 - (i + \\Gi - (i/iv)/(0*)ii + + 1)-^= 

for all t. Further, by (|86] l. we have dt < oo a.s. (since T5 < /^o by construction); also noting that T5 < 1 (again 
by construction), a pathwise application of Proposition IB. 11 to the stochastic recursion (|87] l yields 

limsup(t + 1)^= lies' - (1/A^)/(0*)|| < 00 a.s., 



from which we may further conclude that {t + l)'^" ||G^ — {l/N)I{9*)\\ as t — > cx) a.s., where tq is another 
positive constant such that tq < T5. 

Now introducing another constant such that < < (1 — T2) Arg, by Lemma IbTTI it may be readily concluded 
that 

Pe- f lim {t + ly^ \\Gn{t) - {l/N)I{9*)\\ = o) = I (88) 

for all n. Finally, noting that matrix inversion is a locally Lipschitz operator in a neighborhood of an invertible 
argument, we have by (|9]l, ( fT2] l. and (l88l l that 

\\Knit) - N.r\e*)\\ = \\{Gn{t) + fthiV' - N.r\9*) 

- o {\\Gn{t) - {l/N)I{e*)\\ +^t) = o {{t + 1)--^ + {t + 1)-^^) 

where r' may be taken to be an arbitrary positive constant satisfying r' < T7 A /i2. Hence, the desired assertion 
follows. ■ 
Proof of Lemma 15.31 The following intermediate approximation will be used in the proof of Lemma 15.31 

Lemma B.2. For each n, let {x„(t)} and {v„(t)} be as in the hypothesis of Lemma 15.31 and denote by {u„(<)} 
the {J-t}-adapted process such that Un{t) = x„(t) — v„(t) for all t. Then, for each 7 G [0, 1 — {where T2 is 
the exponent corresponding to {Pt}, see Assumption 13.2b , we have 



Pe* f lim (t + in|u„(i)-Ui(i)ll 



=1 



for all pairs {n, I) of network agents. 

Proof By (O and dSOb , the process {u„(t)} is readily seen to satisfy the recursions 

N 

Unit + 1) = u„(t) - A ^ (Unit) - ui{t)) - atKr,{t)V'^{t) , (89) 
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where 

u;(i) = Ki^nim - K{e*) - /„(0*) (v„(t) - e*) (90) 

for all t. Noting that the processes {x„(t)} and {v„(t)} converge a.s. as t — > cx) (see Corollary 15.11 and Corol- 
lary I5.2l i. we conclude that the sequence {U^(t)}, thus defined, is bounded a.s. Denoting by Uj and Ut the 
block-vectors Vec (U^(t)) and Vec (u„(t)) respectively, from (|89] l we then have 

iif+i = (/atm - PtLt <8) /m) ut - atUt, 

where {uj} is the {J"j}-adapted process given by Uj = Diag (if„(t)) .UJ for all t. Noting that {UJ} is bounded 
a.s. and the adaptive gain sequence {Knit)} converges a.s. as t — > oo for all n, the process {ut] is readily seen 
to be bounded a.s. Further, denoting by {ut} and {Uf} the processes, such that, 

Ut = [Inm - Ijv- (Iat ® ImY^ lit and Ut = {Jnm - Ijv- (Ijv » -^a/)^) Ut 

for all t, we have (using standard properties of the Laplacian) 

Ut = {Inm - PtLt ® hi) % - atUt (91) 

for all t. Clearly, Uf G for all t, and we may note that, at this point the evolution ( |9T| l resembles the dynamics 
analyzed in Lemma l4~2l (for the process {zt}, see (fTOli). Following essentially similar arguments as in (l70li-(l73Tl. 
we have {t + l)^||uf || — > as t — > oo a.s. for all 76 [0, 1 — T2), from which the desired assertion follows. ■ 
Proof of Lemma \5.3\ In what follows we stick to the notation in the proof of Lemma IB. 21 By ( |89] | we have 

that 

N 

u?+i=u?-(l/Ar)«,^if„(i)U;(t), 

where u° = {^/N)^^^-y\in{t) for all t. Now note that, for each n, the function /i„(-) is twice continuously 
differentiable with gradient /„(•) (see Proposition 12. II and Proposition 12. 2t . and hence there exist positive constants 
c and R, such that for each n, 

l|/i„(z)-/i„(r)-/„(r)(z-r)|| <c||z-r||' (92) 

for all z G R^^ with ||z — 6*\\ < R. Since x„(f) -^6* as t — > cx) a.s. (see Corollary 15. lb for each n, there exists 
a finite random time t^, such that 

m^x||x„(t) - 0*11 < i? > a.s. (93) 
Hence, by and (|92ll-(|93]l, we have that 

u;(t) = /i„(x„(t)) - /i„(r ) - /„(0*) (v„(t) - 0*) = (x„(t) - v„(i)) + 7^„(^) 
= /„(0*)u„(^)+7^„(^), 



for all 71 and t, where the residuals 7?.„(t), n = 1, • • • ,N satisfy 

||7^„(^)|| < c||x„(i)-r||' yt>tR a.s. 

Standard algebraic manipulations further yield 

||7^„(^)|| < c||x„(i) - r II' < 2c||x„(t) - v„(i)f + 2c||v„(i) - r II' 

- 2c||u„(i)||' + 2c||v„(t) - ril' < 4c||u^||' + 4c||u„(t) - u^f + 2c||v„(t) ~ r|| 

for all t > tji a.s. 

Note that, the fact that {t + l)^||v„(t) - 0*\\ -J> as t -J- oo a.s. for all n and r G [0, 1/2) implies 
exists a constant 71 > 1/2 such that 

max||v„(t)-r||' = o((t + l)-'^i) a.s.. 

n— 1 

Also, by Lemma |B. 2 1 and the fact that r2 < 1/2 (see Assumption 13.21 ). we have that 

max||u„(i) - ii^ll = o Ut + ly^) a.s. 

for some constant 72 > 1/2. 

By the previous construction, the recursions for {u°} may be written as 



where 



and 



N 



Qt = il/N)J2Kn{t)U0*), 



N 



TZt = (i/N) Kn{t) {in{e*) (u„(t) - u?) + 7^„(^)) 



for all t. By (|94|) and (|98ll we obtain 



Hi 



N 



<(i/7v)^l|i^„W/„(r)||||u„(t)-u?|| 

N N 

+ (4c/7V) ^ ||i^„(t)/„(0*)|| llii^ll' + (4c/iV) ^ \\KrXt)In{e*)\\ ||u„(t) - u«| 



11=1 

N 



n=l 



+ (2c/7V) ^ ||if„(t)/„(0*)|| ||v„(i) - 0*f 

n=l 

for t>tji a.s. Then, denoting by {At} the {J^t}-adapted process such that 



N 



X, = iAc/N)J2\\K„{t)Ue*)\\\\un 
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for all t, and observing that, by ( 1951 1 and 

N 

{2c/N) \\Kn{t)In{e*)\\ l|v„(t) -9*f = o{{t + 1)-''^) a.s., 



n=l 
N 



{1/N) ^ \\K„{t)U9*)\\ ||u„(t) - u?|| = o {{t + 1)--^^) a.s., 



n=l 



and 

N 



(4c/iV) ^ ||i^„(t)/„(r )|| ||u„(t) - u^||2 = o ((i + 1) 



(note that the gain sequences {Kn{t)ys converge a.s., hence, Kn{t) = 0(1) for all n), we obtain the following 
from 



Tit 



< 



Atllu^ll +o((t + l)-^-') (100) 



for some constant 73 such that 1/2 < 73 < 71 A 72. Thus, by (|97|i-(|98]l and (llOOI l, and by making tj^ larger if 
necessary, we conclude that there exists a positive constant b such that 

||ti"+i|| < \\Im - atQt + atXtlAiWlKW + batit + 1)-^' (101) 

for all t > tfl a.s. Since Kn{t) N.I^^{0*) as t — cxo a.s. for each n and X]^=i-^n(^*) = I{9*), we have 
Qt — > Im as ^ — > 00 a.s.; similarly, since for all n the sequences {x„(t)} and {v„(t)} converge to 6* a.s. as 
t 00 (see CoroUarv l5.1l and Corollary 15.2) . it follows (from definition) that u„(t) as t 00 a.s. for all n, 
and hence Af — ?> as t — t' cjo a.s. The fact that, Qt — > Im and At — ^ as t — > 00 a.s., ensures that, by making tu 
larger if necessary, the following holds 

- atQt + atXtlnW < 1 - (2/3).at = 1 - (2/3). (i + 1)^^ (102) 

for all t > ifl a.s. Let 74 be a constant such that 1/2 < 74 < 73 A (2/3); then, by ( I101l )- (ll02b . we have 

||u?+ J < (1 - (2/3). (i + ly') \\ut\\ +dt{t + 1)-^^ 

for all t > tfl a.s., where dt — bat{t + l)')'<i~')'3. Since 74 < 2/3 and the sequence {dt} is summable, a pathwise 
application of Proposition IB. II yields 

Fe* (limsup(t+ 1)^* ||u^|| < 00 ) = 1. 

Hence, by choosing r, such that l/2<r<72A74 (where 72 is defined in (|96])), we have that (t + l)'^u„(t) 
as t — > cx) a.s. for all n and the desired assertion follows. ■ 
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